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Dillon  Engineering,  Inc. 

This  presentation  outlines  an  architecture  for  efficient  Ultra  Long  FFTs  for  use  in  FPGAs 
and  ASICs.  Analysis  of  accuracy,  performance,  cost  and  power  consumption  are 
presented. 

FFTs  are  at  the  heart  of  many  real  time  signal  processing  applications  and  Ultra  Long 
FFTs  are  quite  often  used  for  frequency  analysis  and  communications  applications.  As  the 
processing  requirements  increase,  the  use  of  FPGAs  and  ASICs  become  the  logical 
choice  for  implementing  real  time  FFTs. 

This  presentation  describes  and  efficient  framework  for  implementing  the  Cooley-Tukey 
algorithm  for  Ultra  Long  FFTs  using  minimal  external  memory.  Typically  for  lengths 
over  16K  the  memory  resources  of  the  FPGA  or  ASIC  are  exhausted  and  external 
memory  is  required.  The  architecture  is  implemented  using  two  shorter  length  FFTs 
(lengths  N1  and  N2  )  to  calculate  an  FFT  of  length  N=N1XN2.  This  architecture 
is  optimized  for  continuous  data  FFTs,  minimizing  the  external  memory  requirements 
and  offering  flexibility  so  that  it  can  be  used  for  many  different  applications. 

The  ( Nl  X  N2)  -point  FFT  can  be  computed  as 

JV,  — 1  2nnlk2  N2-l  .2nn2k2  2-nn1kl 

X[k,N2+k2\=  X  [e  '  N  (  X  xi^N.  +  n^e  )]e  N'  . 

77j  =  0  722  =  0 

Computing  this  for  (KA^A^-l  and  0<k2<N2— 1  results  in 

TV  —  1  .2-nnk 

*[*]=Z*[  n]e  N  for  0<k<N  —  \,  as  desired.  This  leads  to  the  following 

n= 0 

high-level  architecture: 

The  input  data  is  re-ordered  by  performing  the  equivalent  of  a  matrix  transposition.  The 
next  step  is  to  compute  N1  FFTs,  each  of  length  N2,  followed  by  the  second  matrix 
transposition.  The  re-ordered  data  set  is  multiplied  by  the  twiddle  factors,  and  N2 
FFTs,  each  of  length  N1 ,  are  computed.  The  final  step  is  to  perform  the  third  matrix 
transposition  so  the  output  data  is  in  the  correct  order. 
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Matrix  Matrix  Matrix 

Transpose  Transpose  Transpose 


The  external  memory  requirements  can  be  reduced  by  using  QDR  synchronous  SRAM 
and  an  addressing  sequence  to  allow  a  single  bank  of  memory  to  be  used  for  each  matrix 
transposition.  This  presentation  describes  details  of  this  addressing  sequence.  SRAM  is  a 
requirement  because  data  needs  to  be  read  and  written  on  every  clock  cycle  and  the 
addresses  are  usually  not  consecutive.  QDR  allows  writing  and  reading  different  locations 
simultaneously,  thereby  removing  the  requirement  for  two  banks  of  memory  at  each 
matrix  transposition. 

The  potential  data  growth  in  longer  length  FFTs  makes  numerical  analysis  a  necessity.  A 
finite  word  length  analysis  will  be  presented  for  both  fixed  and  floating  point  FFTs  which 
will  show  that  either  floating  point  or  fairly  wide  fixed  point  FFTs  are  required  to 
maintain  the  precision  required  for  most  applications.  These  wider  word  lengths  affect 
the  memory  architecture  because  wider  word  lengths  require  more  memoiy  bandwidth  for 
the  matrix  transpositions.  The  trade-offs  between  word  length  requirements  and  memory 
architecture  are  discussed  in  the  presentation. 

This  architecture  can  also  function  as  2D  FFT  by  simply  bypassing  the  twiddle  multiply 
and  removing  the  first  Matrix  Transpose. 

A  variable  length  FFT  engine  can  be  built  form  the  same  architecture  by  using  variable 
length  Nl  and  N2  FFTs  and  modifying  the  Matrix  Transpose  blocks.  Often  a  run 
time  length  selection  is  desired  so  that  the  resolution  can  be  adjusted. 

Accuracy,  component  cost,  and  power  consumption  data  will  be  presented  for  a  system 
implemented  in  a  single  FPGA  and  three  QDR  SRAM  ICs  computing  512K  FFTs  on 
continuous  data  at  200MSPS. 


2  of  2 


Abstract 


DE 


Dillon 

ENGINEERING 


An  Efficient  Architecture  for  Ultra  Long 
FFTs  in  FPGAs  and  ASICs 


Agenda 


■  Architecture  optimized  for  Fast  Ultra  Long  FFTs 

Parallel  FFT  structure  reduces  external  memory  bandwidth 
requirements 

■  Lengths  from  32K  to  256M 
Optimized  for  continuous  data  FFTs 

■  Architecture  reduces  the  algorithm  to  two  smaller  manageable 
FFT  engines 
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Ultra  Long  FFTs 


■  An  FFT  length  that  exceeds  the  internal  memory  requirements  of  the 
FPGA  or  ASIC 

System  cost  can  be  reduced  in  moderate  length  FFTs  in  designs 
where  the  FPGA/ASIC  size  is  driven  by  the  memory 
requirements. 

■  This  architecture  puts  most  of  the  storage  for  the  FFT  off  chip  in 
relatively  inexpensive  SRAM,  reducing  the  system  cost. 

Ultra  Long  FFTs  have  a  similar  structure  to  2D  FFTs 

Cooley-Tukey  algorithm 

Minimizes  external  memory  1C  count  and  bandwidth 
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The  following  shows  the  execution  unit  (logic)  and  memory 
requirement  for  continuous  data  FFTs  of  two  lengths: 


IK 

1 M 

Butterflies 

10 

20 

Memory 

2K 

2M 

■  The  logic  requirements  for  a  1M  FFT  are  only  double  a  IK  FFT, 
while  the  memory  requirements  are  1000  times. 

Logic  for  1M  FFT  easily  fits  into  large  FPGA 

Memory  requirements  exceed  what  is  available  even  in  a  large 
FPGA 
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Computing  N  =  N1  x  N2 

The  N,,  x  N2  FFT  can  be  computed  as: 


N-\  ^-1 

}' 


*[«+*;!=  L  \‘  *  I L 


X 


«2JV«1 


7^=0 


Hj=0 


Computing  this  for: 


o<i1<AT1-i  and  o<jt2<JV2-i 


Results  in: 


JV— 1  v  2tt rk 


X[k]=Y,  x[n]e 


for  0<k<N-l, 


H=0 


.StTHjL  2  ITHi  fci 


as  desired 
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N  =  N1  x  N2  Architecture 
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Transpose 


Transpose 


Transpose 


■  Three  banks  of  external  QDR  Memory  (single  copy  each) 

■  Two  continuous  data  FFTs  (N^  N2)  inside  FPGA 

■  Twiddle  Multiply  provides  vector  rotation  between  ISI2  and  N1 
FFTs. 

Final  matrix  transpose  for  normal  order  output. 
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QDR  SRAM 


Simultaneous  read/writes  (separate  address/data  bus)  allow 
single  bank  of  memory  per  memory  transpose. 

DDR  Style  I/O  so  dual  clock  edge  transfer  with  FPGA  results  in 
narrower  data  path. 

Single  copy  can  be  kept  at  each  stage  while  maintaining 
continuous  data  flow. 

Special  address  sequence  employed  so  data  isn't  overwritten  in 
continuous  data  application.  Reduce  1C  count. 

■  QDR  with  Virtex  II  Pro  I/O  up  to  150MHz  (read/write) 

■  QDR  II  with  Virtex  II  Pro  I/O  up  to  200MHz  (read/write) 


CORDIC  For  Twiddle  Factors  Generation 
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Almost  N/2  twiddle  factors  required. 

Very  large  ROM  for  FPGA  or  ASIC. 

CORDIC  a  natural  fit,  use  coordinate  product  as  input. 


Complex 

TvTultiply 


CORDIC  produces  the  sin/cos  terms  for  angle  input. 
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Matrix  Transpose  Address  Sequence 

Allows  single  copy  for  each  matrix  transpose. 

Operates  on  continuous  data,  one  point  read/written  per  clock  cycle. 
Reduces  memory  1C  count. 

Simple  logic  for  sequence  generation. 

Works  for  square  or  rectangular  matrices. 

Sequence  repeats  after  log2(N)  sets. 

Write  always  follows  read. 

Simple  N  =  N1  x  N2  =  8  example: 


1st 

2nd 

3  rd 

1st 

0 

0 

0 

0 

1 

2 

4 

1 

2 

4 

1 

2 

3 

6 

5 

3 

4 

1 

2 

4 

5 

3 

6 

5 

6 

5 

3 

6 

7 

7 

7 

7 

First  and  last  matrix  transpose  go  left  to  right  in  table,  second  right  to 
left. 
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Numbers  in  radix-2  FFT  can  grow  by  log2(N),  or  1  bit  per  butterfly 
rank. 

■  AIM  FFT  can  have  20  bits  of  growth.  With  16  bit  inputs  results 
would  be  36  bits. 

Scaling  always  required  in  fixed  point  versions. 

Fixed  point  scaling  should  be  limited  to  every  to  every  other  rank, 
so  10  times  for  a  1M  FFT  producing  26  bit  results  from  16  bit 
input. 

Floating  point  FFT  maintains  precision  without  overflowing. 

Floating  Point  FFT  uses  approximately  8  times  the  logic  of  a 
similar  precision  fixed  point  version. 


Virtex  II  Pro  Performance  -  51 2K  FFT 
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80MHz  Continuous  Data 

■  IK  FFT  Engine  - 4  butterflies 

■  512  FFT  Engine  -  4  butterflies 

■  FFT  Engines  at  160MHz 

■  QDR  memory  at  80MHz 

Real  14  bit  input,  complex  24  bit  output 

■  Virtex  II  Pro  -  Device  Usage 

■  Slices  - 12,500 

■  BlockRAM  - 144 

■  MULT18x18  -  88 

■  Fits  in  XC2VP40 
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■  2D  FFT  -  Remove  first  matrix  transpose  and  twiddle  multiply. 

■  Variable  Length  -  Use  variable  length  FFTs  and  dynamic  matrix 
transpose  blocks. 

■  Mixed  Radix  FFTs  -  Substitute  other  than  radix-2  for  2nd  FFT. 

Performance  increases  easy  with  parallel  input  radix-2  FFTs  and 
multiple  paths  to  SRAM. 
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ParaCore  Architect  (parameterized  core  builder) 

■  DSP  Algorithms 

■  Mixed  radix  FFTs 

■  2D  FFTs  for  image  processing 

■  Fixed  or  floating-point  FFTs 

■  Floating  point  math  library 

■  AES  Cryptography 

■  System  level  DSP 

■  OFDM  Transceivers 

■  Radar  Processing  on  single  FPGA 

■  Image  Compression/Processing 

■  Hardware/Software  SOC 

■  High  speed  Ethernet  Appliances 

■  Linux  Based  SOC  in  FPGA 

■  MicroBlaze  application 
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■  Architecture  optimized  for  Fast  Ultra  Long  FFTs 

Parallel  FFT  structure  reduces  external  memory  bandwidth 
requirements 

■  Lengths  from  32K  to  256M 
Optimized  for  continuous  data  FFTs 

■  Architecture  reduces  the  algorithm  to  two  smaller  manageable 
FFT  engines 

Key  Features 

■  Uses  2  short  manageable  FFT  engines  (N  =  N1  x  N2) 

■  QDR  SRAM,  reduce  1C  count,  simultaneous  read/write 

■  CORDIC  to  generate  rotation  twiddle  factors 

■  Matrix  transpose  address  sequence 

■  Structure  similar  to  2D  FFT  or  mixed  radix  FFT 
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Computing  N  =  N1  x  N2 

The  N,,  x  N2  FFT  can  be  computed  as: 


N-\  ^-1 

}' 


*[«+*;!=  L  \‘  *  I L 


X 


«2JV«1 


7^=0 


Hj=0 


Computing  this  for: 


o<i1<AT1-i  and  o<jt2<JV2-i 


Results  in: 


JV— 1  v  2tt rk 


X[k]=Y,  x[n]e 


for  0<k<N-l, 


H=0 


.StTHjL  2  ITHi  fci 


as  desired 
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N  =  N1  x  N2  Architecture 


In 
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Matrix  Matrix  Matrix 


Transpose 


Transpose 


Transpose 


■  Three  banks  of  external  QDR  Memory  (single  copy  each) 

■  Two  continuous  data  FFTs  (N^  N2)  inside  FPGA 

■  Twiddle  Multiply  provides  vector  rotation  between  ISI2  and  N1 
FFTs. 

Final  matrix  transpose  for  normal  order  output. 
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